Statistics

From the Visualization step and onwards, statistics can be displayed by clicking on the Statistics button on the main toolbar.

NOTE:All statistics are calculated and displayed on the raw data imported from the source. No limits are taken into account when calculating statistics.

Calculated Statistics and Columns

NOTE: Statics may be omitted and displayed as blank fields where applicable for fields where all quality is marked as bad or fields have zero variance.

Data Type: The Data Type column displays whether a particular field is a Continuous (numeric) variable or a Discrete (eg. text) variable.
Rows: The number of rows displayed in this column is a result of the combination of the Data Filters selected. If the filters are set to Ignore and All data, then the number that is displayed is all the number of GOOD QUALITY rows of data for each field that was originally loaded. See the Data Filters section for more details on the filtering options available.

NOTE:The total number of rows listed below the statistics view is the number of rows of data that make up the original data source. Eg. Using 695 of 696 rows in dataset implies that of the 696 rows of data in the original dataset, 695 is marked as GOOD QUALITY and as a result will subsequently be used for modeling purposes.
Minimum: Numerically the smallest value of the field.
Maximum: Numerically the largest value of the field.
Average: The average value of the field. This is calculated by adding all the values of the field, divided by the count number of the field.
Variance: A measure of statistical dispersion, averaging the squared distance of its possible values from the expected value, which is the mean value.
Std Dev: The standard deviation of the field - a measure of dispersion of the field values.
Skew: The measure of the asymmetry of the distribution values of the fields. A distribution has a positive skew (right-skewed) if the right (higher value) tail is longer or fatter and negative skew (left-skewed) if the left (lower value) tail is longer or fatter.
Kurtosis: A measure of whether the data distribution is peaked or flat, relative to a normal bell curve distribution.

CSense calculates Excess Kurtosis where the kurtosis of 0 is commonly interpreted as a normal distribution.

A positive excess kurtosis (excess kurtosis greater than 0) has a more acute peak (a higher probability than a normally distributed variable of values near the mean) and fat tails (that is, a higher probability than a normally distributed variable of extreme values).

A negative excess kurtosis (excess kurtosis lesser than 0) has a smaller peak (a lower probability than a normally distributed variable of values near the mean) and thin tails (a lower probability than a normally distributed variable of extreme values).
Unique: The number of unique values of the field

Data Filters

Limits filter
- Ignore
  
  This option calculates the statistics ignoring any limits that may have been applied in the histogram views.

- Apply individually
  
  This option calculates the statistics on a specified field using only the GOOD QUALITY data between the Low Low and High High limits of that specific field. The Rows column will display the number of rows of data that fall within this subset.
- Apply globally
  
  If any outer limits have been set in the histogram views, then those rows of data that have been excluded from the fields in question will now be excluded from the rest of the dataset. In other words, if rows of data were marked as BAD QUALITY for a specific field, then that entire row (for all the other fields) will be marked as BAD and excluded from the statistics calculations.

Brush filter
- All Data
  
  This option calculates the statistics for ALL data (brushed and non-brushed data)

- Inside brushed data
  
  This option calculates the statistics for only the brushed data
- Outside brushed data
  
  This option calculates the statistics for only the non-brushed data

An example demonstrating the use of the filters

Row Number	A_Quality	B_Quality	C_Quality
1	1	1	1
2	1	1	1
3	1	1	1
4	1	0	1
5	1	1	1
6	1	1	1
7	1	1	0
8	1	1	1
9	1	1	1
10	1	1	1

Results from the above example:

The number of rows listed will read: Using 8 of 10 rows in dataset.
The Rows column will read: 8 of 10 when Ignore and All data are selected from the filters.
If NO outer limits are set in the histogram views, then the Rows column will read: 8 when Apply individually or Apply globally are selected.
If the outer limits on A were to be adjusted, eg. resulting in the removal of rows 1 and 10 from A; then selecting Ignore and All data will result in the Rows column showing 8 of 10.
If Apply individually is selected then Rows will show 6 for field A, and 8 for fields B and C.
If Apply globally is selected then Rows will show 6 for all of the fields.
Selecting Inside brushed data (provided brushing has been done) will result in Rows displaying the number of rows of data that falls within the brushed area(s) that are of good quality.
Selecting Outside brushed data (provided brushing has been done) will result in Rows displaying the number of rows of data that falls outside the brushed area(s) that are of good quality.

Export Option

This option exports the statistics view as a Comma Separated Value (CSV) file.

The statistics view also calculates and displays the total number of available good quality rows of data out of a total number of rows of data below the list of statistics.